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Abstract. Randomized matrix sparsification has proven to be a fruitful technique for producing 
faster algorithms in applications ranging from graph partitioning to semidefinite programming. In 
the decade or so of research into this technique, the focus has been — with few exceptions — on 
ensuring the quality of approximation in the spectral and Frobenius norms. For certain graph 
algorithms, however, the oo^l norm may be a more natural measure of performance. 

This paper addresses the problem of approximating a real matrix A by a sparse random matrix 
X with respect to several norms. It provides the first results on approximation error in the oo — >1 
and oo — >2 norms, and it uses a result of Latala to study approximation error in the spectral norm. 
These bounds hold for a reasonable family of random sparsification schemes, those which ensure that 
the entries of X are independent and average to the corresponding entries of A. Optimality of the 
oo — >1 and oo — »2 error estimates is established. Concentration results for the three norms hold when 
the entries of X are uniformly bounded. The spectral error bound is used to predict the performance 
of several sparsification and quantization schemes that have appeared in the literature; the results 
are competitive with the performance guarantees given by earlier scheme-specific analyses. 



1. Introduction 

Massive datasets are ubiquitous in modern data processing. Classical dense matrix algorithms 
are poorly suited to such problems because their running times scale superlinearly with the size 
of the matrix. When the dataset is sparse, one prefers to use sparse matrix algorithms, whose 
running times depend more on the sparsity of the matrix than on the size of the matrix. Of course, 
in many applications the matrix is not sparse. Accordingly, one may wonder whether it is possible 
to approximate a computation on a large dense matrix with a related computation on a sparse 
approximant to the matrix. 

Let || • || be a norm on matrices. We may phrase the following question: Given a matrix A, how 
can one efficiently generate a sparse matrix X for which the approximation error \\A — X\\ is small? 

In the seminal papers [AM07, AM01], Achlioptas and McSherry demonstrate that one can bound, 
a priori, the spectral and Frobenius norm errors incurred when using a particular random approx- 
imation scheme. They use this scheme as the basis of an efficient algorithm for calculating near 
optimal low-rank approximations to large matrices. In a related work [AHK06], Arora, Hazan, 
and Kale present another randomized scheme for computing sparse approximants with controlled 
Frobenius and spectral norm errors. 

The literature has concentrated on the behavior of the approximation error in the spectral and 
Frobenius norms; however, these norms are not always the most natural choice. For instance, the 
problem of graph sparsification is naturally posed as a question of preserving the cut-norm of a 
graph Laplacian. The strong equivalency of the cut-norm and the oo — >1 norm suggests that, for 
graph-theoretic applications, it may be fruitful to consider the behavior of the oo — >1 norm under 
sparsification. In other applications, e.g., the column subset selection algorithm in Tio()!)J. the 
oo— >2 norm is the norm of interest. 

This paper investigates the errors incurred by approximating a fixed real matrix with a random 
matrix. Our results apply to any scheme in which the entries of the approximating matrix are 
independent and average to the corresponding entries of the fixed matrix. Our main contribution is 
a bound on the expected oo— >p norm error, which we specialize to the case of the oo^l and oo^2 
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norms. We also use a result of Latala |Lat05] to find an optimal bound on the expected spectral 
approximation error, and we establish the subgaussianity of the spectral approximation error. 

1.1. Summary of results. Consider the matrix A as an operator from to i^ 1 , where 1 < p < 
oo. The operator norm of A is defined as 

n 4ii \\ Au \\p 
A oo-mj = max - n — n — . 

ix^O Halloo 

The major novel contributions of this paper are estimates of the oo — > 1 and oo — > 2 norm 
approximation errors induced by random matrix sparsification schemes. Let A be a target matrix. 
Suppose that X is a random matrix with independent entries that satisfies EX = A. Then 



E\\A-X\\ 00 _ 1 <2 



and 



E \\A - X\\^ 2 < 2 (J2 jk Var(X jfe )) 1/2 + 2^rmnmax (j2 k ^f^) 



1/2 



where D is a positive diagonal matrix with trace(Z) 2 ) = 1. We complement these estimates with 
tail bounds for the approximation error in each norm. 
We also note that 



EIIA-XII < C 



m f x \Z^ k Var ^i fc )J + ™f x [Z^j Var (^ifc)J + [Z^ jk H x jk ~ a jk ) 



where C is a universal constant. This estimate follows from Latala's work [Lat05j . We use our spec- 
tral norm results to analyze the sparsification and quantization schemes in [AHK06] and [AM07J . 
and we show that our analysis yields error estimates competitive with those established in the 
respective works. 

While on the path to proving the oo — >1 and oo — >2 norm approximation error estimates, we derive 
analogous relations of independent interest that bound the expected norm of a random matrix Z 
with independent, zero-mean entries: 

E||Z|| 00 ^ 1 <2E(||Z|| col + ||Z T || col ), 

where ||A|| co] is the sum of the £2 norms of the columns of A, and 

EIIZII _~ < 2E||Z|| p + 2minE||Z J Cr 1 |L 

11 1100^2 — 11 \\b jj M 112^00 ' 

where D is a diagonal positive matrix with trace (D 2 ) = 1. More generally, 

E II^IL^ < 2E II V e k z k + 2 max E V | V . e 3 Z jk u 3 

1 1 K V ||u|| =1 I 3 

where q is the conjugate exponent to p and z k is the fcth column of Z. Here, {e k } is a sequence of 
independent random variables uniform on {±1}. 

1.2. Outline. Section [2] offers an overview of the related strands of research, and Section [3] intro- 
duces the reader to our notations. In Section HJ we establish the foundations for our 00 — >1 and 
00 -^2 error estimates: an estimate of the expected 00 p norm of a random matrix with inde- 
pendent, zero-mean entries and a tail bound for this quantity. In Sections [5] and [61 these estimates 
are specialized to the cases p = 1 and p = 2, respectively, and we establish the optimality of the 
resulting expressions. The bounds are provided in both their most generic forms and ones more 
suitable for applications. In Section [TJ we use a result due to Latala |Lat05] to find an optimal 
estimate of the spectral approximation error. A deviation bound is provided for the spectral norm 
which captures the correct (subgaussian) tail behavior. We show that our estimates applied to the 
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sparsification schemes in [AHK06] and |AM07] recover performance guarantees comparable with 
those obtained from the original analyses, under slightly weaker hypotheses. 

2. Related Work 

In recent years, much attention has been paid to the problem of using sampling methods to 
approximate linear algebra computations efficiently Such algorithms find applications in areas like 
data mining, computational biology, and other areas where the relevant matrices may be too large 
to fit in RAM, or large enough that the computational requirements of standard algorithms become 
prohibitive. Here we review the streams of literature that have motivated and influenced this work. 

2.1. Randomized sparsification. The seminal research of Frieze, Kannan, and Vempala [FKV98|, 
IFKV04] on using random sampling to decrease the running time of linear algebra algorithms focuses 
on approximations to A constructed from random subsets of its rows. The subsequent influential 
works of Drineas, Kannan, and Mahoney [DKM06a, DKM06b, DKMOGcJ also analyze the perfor- 
mance of Monte Carlo algorithms which sample from the columns or rows of A. 

In [AMOlJ, Achlioptas and McSherry advance a different approach: instead of using low-rank 
approximants, they sample the entries of A to produce a sparse matrix X that has a spectral 
decomposition close to that of A. This transition from row/column sampling to independent 
sampling of the entries allows them to bring to bear powerful techniques from random matrix 
theory. In AMOlj. their main tool is a result on the concentration of the spectral norm of a 
random matrix with independent, zero-mean, bounded entries. In a follow-up paper, [AM 07J . 
they obtain better estimates by using a sharper concentration result derived from Talagrand's 
inequality. They point out the importance of using schemes which sparsify a given entry in A with 
a probability proportional to its magnitude. Such adaptive sparsification schemes keep the variance 
of the individual entries small, which tends to keep the error in approximating A with X small. 
Their scheme requires either two passes through the matrix or prior knowledge of an upper bound 
for the largest magnitude in A. 

In |AHK06| . Arora, Hazan, and Kale describe a random sparsification algorithm which partially 
quantizes its inputs and requires only one pass through the matrix. They use an epsilon-net 
argument and Chernoff bounds to establish that with high probability the resulting approximant 
has small error and high sparsity. 

2.2. Probability in Banach spaces. In [RV07], Rudelson and Vershynin take a different ap- 
proach to the Monte Carlo methodology for low-rank approximation. They consider A as a linear 
operator between finite-dimensional Banach spaces and apply techniques of probability in Banach 
spaces: decoupling, symmetrization, Slepian's lemma for Rademacher random variables, and a law 
of large numbers for operator-valued random variables. They show that, if A can be approximated 
by any rank-r matrix, then it is possible to obtain an accurate rank-A; approximation to A by 
sampling O(rlogr) rows of A. Additionally, they quantify the behavior of the oo — >1 and oo ^2 
norms of random submatrices. 

Our methods are similar to those of Rudelson and Vershynin in [RV07] in that we consider A 
as a linear operator between finite-dimensional Banach spaces and use some of the same tools 
of probability in Banach spaces. Whereas Rudelson and Vershynin consider the behavior of the 
norms of random submatrices of A, we consider the behavior of the norms of matrices formed 
by randomly sparsifying (or quantizing) the entries of A. This yields error bounds applicable to 
schemes that sparsify or quantize matrices entrywise. Since some graph algorithms depend more 
on the number of edges in the graph than the number of vertices, such schemes may be useful in 
developing algorithms for handling large graphs. 
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2.3. Random sampling of graphs. The bulk of the literature has focused on the behavior of 
the spectral and Frobenius norms under randomized sparsification, but the oo — >1 norm occurs 
naturally in connection with graph theory. Let us review the relevant ideas. 

Consider a weighted simple graph G = (V,E,lo) with adjacency matrix A given by 



a jk 



Ujk, (j,k)eE 
0, otherwise. 



A cut is a partition of the vertices into two blocks: V = S U S. The cost of a cut is the sum of the 
weights of all edges in E which have one vertex in S and one vertex in S. Several problems relating 
to cuts are of considerable practical interest. In particular, the maxcut problem, to determine 
the cut of maximum cost in a graph, is common in computer science applications. The cuts of 
maximum cost are exactly those which realize the cut-norm of the adjacency matrix, which is 
defined as 



\ n = max 

^ ScE 



<(j,k)eE 

where Is is the indicator vector for S. Finding the cut-norm of a general matrix is NP-hard, but 
in [AN04j . the authors offer a randomized polynomial-time algorithm which finds a submatrix A 
of A such that | X^fc^jfcl — 0-^6 11^4-Hc- One crucial point in the derivation of the algorithm is the 
fact that the 00 — >1 norm is strongly equivalent with the cut-norm: 

\\A\\r < HAIL 1 < 4IIAIL. 

II lie — H Hoo — >1 — II HO 

In his thesis )Kar95] and the sequence of papers Kai 9 la. Kar94bl IKar96j . Karger introduces 
the idea of random sampling to increase the efficiency of calculations with graphs, with a focus 
on cuts. In [Kar96j, he shows that by picking each edge of the graph with a probability inversely 
proportional to the density of edges in a neighborhood of that edge, one can construct a sparsifier, 
i.e., a graph with the same vertex set and significantly fewer edges that preserves the value of each 
cut to within a factor of (1 ± e). 

In [HB08j , Spielman and Srivastava improve upon this sampling scheme, instead keeping an edge 
with probability proportional to its effective resistance — a measure of how likely it is to appear 
in a random spanning tree of the graph. They provide an algorithm which produces a sparsifier 
with 0((nlogn)/e 2 ) edges, where n is the number of vertices in the graph. They obtain this result 
by reducing the problem to the behavior of projection matrices lie an d I^c associated with the 
original graph and the sparsifier, and appealing to a spectral norm concentration result. 

The logn factor in [SS08] seems to be an unavoidable consequence of using spectral norm con- 
centration. In [BSS09], Batson et. al. prove that the logn factor is not intrinsic: they establish 
that every graph has a sparsifier that has O(n) edges. The proof is constructive and provides a 
deterministic algorithm for constructing such optimal sparsifiers in 0(n 3 m) time, where m is the 
number of edges in the original graph. 

The algorithm of [BSS09] is clearly not suitable for sparsifying graphs with a large number of 
vertices. Part of our motivation for investigating the 00 —*1 approximation error is the belief that 
the equivalence of the cut-norm with the 00 — >1 norm means that matrix sparsification in the 00 — >1 
norm might be useful for efficiently constructing optimal sparsifiers for such graphs. 

3. Background 

We establish the notation used in the sequel and introduce our key technical tools. 
All quantities are real. The A;th column of the matrix A is denoted by and the entries are 
denoted ajk- 

For 1 < p < 00, the t v norm of x is written as [|je|| . We treat A as an operator from £™ to i^ 1 , 
and the p — ► q operator norm of A is written as || A|| ? . The p — > q and q' — > p' norms are dual 
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in the sense that 



I = llA T ll 

lp-><? II llg'->p' 



This paper is concerned primarily with the spectral norm and the oo — >1 and oo — >2 norms. The 
spectral norm || A|| is the largest singular value of A. The oo — >1 and oo — »2 norms do not have such 
nice interpretations, and they are NP-hard to compute for general matrices [RohOOj. We remark 
that || A|| 00 ^ 1 = || Aa?!^ and HA^^ = 11^-2/112 f° r certain vectors x and y whose components take 
values ±1. 

An additional operator norm, the 2 — > oo norm, is also of interest: it is the largest £2 norm 
achieved by a row of A. We encounter two norms in the sequel that are not operator norms: the 
Frobenius norm, denoted by ||A|| F , and the column norm 

The expectation of a random variable X is written EX and its variance is written Var(AT) = 
E(X — EX) 2 . The expectation taken with respect to one variable X, with all others fixed, is written 
E x . The L q norm of X is denoted by E q (X) = (ElX^) 1 ^. 

The expression X ~ Y indicates the random variables X and Y are identically distributed. 
Given a random variable X, the symbol X' denotes a random variable independent of X such that 
X' ~ X. The indicator variable of the event X > Y is written tx>Y- 

The Bernoulli distribution with expectation p is written Bern(p) and the Binomial distribution 
of n independent trials each with success probability p is written Bin(n,p). We write X ~ Bern(p) 
to indicate X is Bernoulli. 

A Rademacher random variable takes on the values ±1 with equal probability. A vector whose 
components are independent Rademacher variables is called a Rademacher vector. A real sum 
whose terms are weighted by independent Rademacher variables is called a Rademacher sum. The 
Khintchine inequality [Sza76j gives information on the moments of a Rademacher sum; in particular, 
it tells us the expected value of the sum is equivalent with the £2 norm of the vector x: 

Proposition 1 (Khintchine inequality). Let x be a real vector, and let e be a Rademacher vector. 
Then 



/=H 2 < E > EkXk 



V2 



< ll^l| 2 • 



4. The oo^p norm of a Random Matrix 

We are interested in schemes that approximate a given matrix A by means of a random matrix 
X in such a way that the entries of X are independent and EX = A. It follows that the error 
matrix Z = A — X has independent, zero- mean entries. Our intellectual concern is the class of 
sparse random matrices, but this property does not play a role at this stage of the analysis. 

In this section, we derive a bound on the expected value of the 00— >p norm of a random matrix 
with independent, zero-mean entries. We also study the tails of this error. In the next two sections, 
we use the results of this section to reach more detailed conclusions on the 00 — »1 and 00 — >2 norms 
of Z. 



4.1. Expected 00 ^p norm. The main tool used to derive the bound on the expected norm of 
Z is the following symmetrization argument [vW96l Lemma 2.3.1 et seq.]. 

Proposition 2. Let Z±, . . . , Z n , Z[, . . . , Z' n be independent random variables satisfying Zi ~ Z[, 
and let e be a Rademacher vector. Let T be a family of functions such that 

su pEI Af(z k )-f(z' k )) 
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is measurable. Then 

Esu pEI Xf(Zk)-f(Z' k )) = EsupJ2l e k (f(Z k )-f(Z' k )). 

Since we work with finite-dimensional probability models and linear functions, measurability 
concerns can be ignored. 

The other critical tool is a version of Talagrand's Rademacher comparison theorem [LT911 The- 
orem 4.12 et seq.]. 

Proposition 3. Fix finite- dimensional vectors z±, . . . ,z n and let e be a Rademacher vector. Then 

e k \(z k ,u)\ < E max > e k (z k ,u). 

k=l m *' /l ~ \\ u \\ =l^k=l fcV fe ' ' 



Now we state and prove the bound on the expected norm of Z. 

Theorem 1. Let Z be a random matrix with independent, zero-mean entries and let e be a 
Rademacher vector independent of Z. Then 

E ll Z lloo^ B ^ 2E \\y^, £ k z k +2 max [V iS^ £jZ jk Uj 

00 p W^k p \\u\\ q =l ^kl^j J J J 

where q is the conjugate exponent of p. 
Proof of Theorem [0 By duality, 

E HZIL^, = E \\Z T \\ q ^ = E m« X)j(z fc ,«)|. 

Center the terms in the sum and apply subadditivity of the maximum to get 

E||Z|| <E max V (|<« fc ,ti)| - E'| (z' k , u)\) + max 

Nl,=i fc \m\ q =i *-^ k ^ 

=:F + S. 

Begin with the first term in (pQ). Use Jensen's inequality to draw the expectation outside of the 
maximum: 

F<E maK^iK^^l-Kz',,^). 

Now apply Proposition [2] to symmetrize the random variable: 

F < E max V e k (\{z k ,u)\ - |(4, W >D- 

By the subadditivity of the maximum, 



F < 



E max V\ e k \(z k ,u)\ + max V\ -e k \(z k ,u)\ = 2E max V\ e k \(z k ,u)\, 
\JMI,=i k Hl 9 =i * J NI<,=1 fc 



where we have invoked the fact that — e k has the Rademacher distribution. Apply Proposition [3] 
to get the final estimate of F: 

F < 2E max V] s k (z k ,u) = 2E max ( V] e k z k , u) = 2E V] e k z k . 
IMI 9 =i^ fc ll«ll<,=i ^^ k 1 "^ k p 

Now consider the last term in ([1]). Use Jensen's inequality to prepare for symmetrization: 
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Apply Proposition [2] to the expectation of the inner sum to see 

s - zZ k E £ j^ k - z 'jk) u i 

The triangle inequality gives us the final expression: 

S < max 2E> > e^,Z~ k Ui . 
~ \\ u \\=\ 3 3k 3 

Introduce the bounds for F and S into ([1]) to complete the proof. □ 

4.2. Tail bound for oo^p norm. We now develop a deviation bound for the oo— >p approxima- 
tion error. The argument is based on a bounded differences inequality. 

First we establish some notation. Let g : R n —* IR be a measurable function of n random 
variables. Let Xi, ... ,X n be independent random variables, and write W = g{X\, . . . ,X n ). Let 
Wi denote the random variable obtained by replacing the ith argument of g with an independent 
copy: W i = g(X u ...,X',...,X n ). 

The following bounded differences inequality states that if g is insensitive to changes of a single 
argument, then W does not deviate much from its mean. 

Proposition 4 ([BLM03]). Let W and {Wi} be random variables defined as above. Assume that 
there exists a positive number C such that, almost surely, 



Then, for all t > 0, 



(W-Wi) 2 ± w>Wi <c. 



P (W > EW + t)<e 



-t 2 /(4C) 



/A- 



To apply Proposition HI we let Z = A — X be our error matrix, W = H^H^ , and W 3 
ll^nloo— >p' wnere ^ k i s a matrix obtained by replacing aj k — Xj k with an identically distributed 
variable aj k — X'- k while keeping all other variables fixed. The oo — * p norms are sufficiently 
insensitive to each entry of the matrix that Proposition 0] gives us a useful deviation bound. 

Theorem 2. Fix an m x n matrix A, and let X be a random matrix with independent entries for 
which EX = A. Assume \Xj k \ < y almost surely for all j, k. Then, for all t > 0, 

P (|L4 - X\\^ p > E \\A - XW^ + 1) < e -* 2 /(^ 2 - s ) 

where s = max{0, 1 — 2/q} and q is the conjugate exponent to p. 

Proof. Let q be the conjugate exponent of p, and choose u,v such that W = u T Zv and ||tt|| 9 = 1 
and IMloo = 1. Then 



(W - W jk )t w>WJ k <u T (z - Z jk ^ v l w>W jk = (X' jk - X jk ) Uj v k t w>W jk < D\ Uj v k \. 
This implies 

V. (W - W 3k )H w>wjk <D 2 Y^ \u jVk \ 2 < nD 2 \\u\\ 2 2 , 

so we can apply Proposition [J] if we have an estimate for 1 1 1 1 2 - We have the bounds ||tt|| 2 < \\ u \\ q 
for q G [1,2] and ||w|| 2 < 

m l/2-l/9 || u || 

for q £ [2,oo]. Therefore, 

(W-W 3k ) 2 t w>W3k <D 2 { A J 

3,k In, q G [1, 2J. 
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It follows from Proposition [5] that 

P - X\\ 00 _^ p > E \\A - X\\ 00 _^ p + t^j =P{W >EW + t) < e -* 2 /(4D 2 nm») 
where s = max{0, 1 — 2/q} . □ 

It is often convenient to measure deviations on the scale of the mean. Taking t = 5E || A — XW^^ 
in Theorem [2] gives the following result. 

Corollary 1. Under the conditions of Theorem^ for all 5 > 0, 

P (\\A - X|| > (1 + 5)E \\A - X\\ ) < e-^(E|l^-X|U^) 2 /(4DW)_ 



5. Approximation in the oo^l norm 

In this section, we develop the oo — > 1 error bound as a consequence of Theorem [TJ We then 
prove that one form of the error bound is optimal, and we describe an example of its application 
to matrix sparsification. 

5.1. Expected oo^l norm. To derive the oo^l error bound, we first apply Theorem [T] with 
p = 1. 



Theorem 3. Suppose that Z is a random matrix with independent, zero-mean entries. Then 

I col)' 



E||^|| 00 ^ 1 <2E(||Z|| col + ||Z T| 



(2) 



1/2 



Proof. Apply Theorem [T] to get 

E \\Z\\ oa ^ 1 < 2E ||E fc efeZfe x + 2 max ^,EjZ jk Uj 
=:F + S. 

Use Holder's inequality to bound the first term in ([2]) with a sum of squares: 

F = 2E E, |E fe e ^| = 2E *E, E - |E fc e ^ 

<2E z ^.(e £ |^^| 2 ) 

The inner expectation can be computed exactly by expanding the square and using the indepen- 
dence of the Rademacher variables: 

^^ 2E E,(E fe ^) 1/2 = 2E ll^ T IL- 

We treat the second term in the same manner. Use Holder's inequality to replace the sum with a 
sum of squares and invoke the independence of the Rademacher variables to eliminate cross terms: 

(2 \ ^' 1/2 
E e V EjZikUj ) =2 max [V (> Z%uf\ 

Since WuW^ = 1, it follows that < 1 for all j, and 

^ 2E E, (E,^) V2 = 2E||Z|1 ^- 

Introduce these estimates for F and S into ([2]) to complete the proof. □ 
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E||A-X|| 00 ^ 1 <2E 



1/2 



E fc (E/°;* - x ^) 2 ) + E, (E fc («^ - 

A simple application of Jensen's inequality gives an error bound in terms of the variances of the 
entries of X. 

Corollary 2. Fix the matrix A, and let X be a random matrix with independent entries for which 
EXjk = ajk- Then 

\ V 2 « , /f-^ \ 1/2" 

E\\A-X\\ 00 _ 1 <2 ^ ' ^ — - > : V- /V v > 



E fe (E, Var(X ifc )) ' + £ . (£ fc Var(X ife )) 



5.2. Optimality. The bound in Corollary [2] is optimal in the sense that there are families of 
matrices A and random approximants X for which E \\A — X|| 00 ^ 1 grows like one of the terms 
in the bound and dominates the other term in the bound. To show this, we construct specific 
examples. 

Let A be a tall m x \Jm matrix of ones and choose the approximant Xj^ ~ 2 Bern • With 
this choice, Var(Xjfc) = 1, so the first term in the bound is m and the second term is m 5 / 4 . The 
following argument from [RV071 Sec. 4.2] establishes that \\A — X^oo— >i grows like m 5//4 . 

Observe that the matrix A — X = [£jk], where are i.i.d. Rademacher variables. Its oo^l 
norm is 



\A-X\ 



oo— >1 



max 

l|j/lloo = 



3,k 



Let 5k be a sequence of i.i.d. Rachemacher variables. By the scalar Khintchine inequality, 



^EjE fc ^hE^N-^ = ^ 



m 



5/4 



V2 

The probabilistic method shows that there is a sign vector x for which 



1 



> —=m 
~ V2 



5/4 



EJE^^ 

Choose the vector y with components yj = sgn(^ fc e^x^)- Then 

ii a - xii oc ^ 1 > . E fe £ ^yj^k = J2j |E fe £ i kXk 



V2 



This shows \\A — XWoo^i grows like the second term in the error bound, so this term of the bound 
cannot be ignored. 

These arguments, applied to a fat ^/n x n matrix of ones, also establish the necessity of the first 
term. 

5.3. Example application. In this section we provide an example illustrating the application of 
Corollary [2] to matrix sparsification. 

From Corollary [2] we infer that a good scheme for sparsifying a matrix A while minimizing the 
expected relative oo — >1 error is one which drastically increases the sparsity of X while keeping the 
relative error 

Zk ( Ej Vax(X,- fc )) V2 + gj ( E fc Vax(XjiQ N ^ 

ll-^-Hoo-»l 

small. Once a sparsification scheme is chosen, the hardest part of estimating this quantity is 
probably estimating the oo — >1 norm of A. The example shows, for a simple family of approximation 
schemes, what kind of sparsification results can be obtained using Corollary [2] when we have a very 
good handle on this quantity. 
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Consider the case where A is an n x n matrix whose entries all lie within an interval bounded 
away from zero; for definiteness, take them to be positive. Let 7 be a desired bound on the expected 
relative 00 — »1 norm error. We choose the randomization strategy Xjf. ~ ^ Bern(p) and ask how 
small can p be without violating our bound on the expected error. 

In this case, 

ll A lloc^l = X^ fc a ^ = ( n2 )' 



a 2 



and Yai(Xjk) = — a 2 k . Consequently, the first term in Corollary [2] satisfies 




I col 





= o 

and likewise the second term satisfies 

. /„ \l/2 

Therefore the relative 00 — >1 norm error satisfies 

E k ( Ej Var(X jfc )) V2 + Zj ( Efc Varg^; ' ^ 

ll^-lloo^l 

It follows that E||A — X|| 00 ^ 1 < 7 for p on the order of (1 + wy 2 )^ 1 or larger. The expected 
number of nonzero entries in X is pn 2 , so for matrices with this structure, we can sparsify with a 
relative 00 ^1 norm error smaller than 7 while reducing the number of expected nonzero entries 

2 

to as few as 0( ^ -2 ) = 0(-^). Intuitively, this sparsification result is optimal in the dimension: 
it seems we must keep on average at least one entry per row and column if we are to faithfully 
approximate A. 

6. Approximation in the 00^2 norm 

In this section, we develop the 00 — > 2 error bound stated in the introduction, establish the 
optimality of a related bound, and provide examples of its application to matrix sparsification. To 
derive the error bound, we first specialize Theorem [1] to the case of p = 2. 

Theorem 4. Suppose that Z is a random matrix with independent, zero-mean entries. Then 



E ll Z lloc^2 ^ 2E ll Z llF + 2 ™i nE || Zi:rl |l2 



-OG 



where D is a positive diagonal matrix that satisfies trace(-D 2 ) = 1. 
Proof. Apply Theorem [1] to get 

, +2 max I: > > : ,'/. ,i.a , 

2 H| 2 =i ^k\^j 3 ° K 

F + S. 



E||Z|| 0O ^ 2 <2E||y^ r e k z k 



(3) 



Expand the first term, and use Jensen's inequality to move the expectation with respect to the 
Rademacher variables inside the square root: 
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The independence of the Rademacher variables implies that the cross terms cancel, so 

F ^ 2E (E,E fc ^) 1/2 = 2E||z|lF - 

We use the Cauchy-Schwarz inequality to replace the l\ norm with an £2 norm in the second 
term of ([3]). A direct application would introduce a possibly suboptimal factor of \/n (where n 
is the number of columns in Z), so instead we choose dk > such that Y^ fc d\ = 1 and use the 
corresponding weighted I2 norm: 



S = 2 max 



1 



V . EjZ jk Uj 



( 



di 



dk < 2 max E 



E, 



2\ 1/2 



V 



4 



Move the expectation with respect to the Rademacher variables inside the square root and observe 
that the cross terms cancel: 



S < 2 max Ez 

Hl 2 =i 



E, 



2\ 1/2 



V 



4 



2 max E 

ll«il a =i 1 j 



k d\ 



1/2 



Use Jensen's inequality to pass the maximum through the expectation, and note that if ||tt| 
then the vector formed by elementwise squaring u lies on the £\ unit ball, thus 

2 \ I/ 2 



5 < 2E max V 
\|Ml 1= i^j,fc 



Zjk 
4 



( (z \ 2 \ 1/2 

Clearly this maximum is achieved when u is chosen so Uj = 1 at an index j for which I J^fc ( ~ir ) ) 

is maximal and Uj = otherwise. Consequently, the maximum is the largest of the £2 norms 
of the rows of ZD~ l , where D = diag(di, . . . , d n ). Recall that this quantity is, by definition, 
||.Z.D _1 |L . Therefore S < 2E\\ZD~ 1 \\ . The theorem follows by optimizing our choice of 

II II £ ^ OO 'I II Zi ^ OO 

D and introducing our estimates for F and S into ([3]). □ 



Taking Z = A — X in Theorem HI we have 
E || A -XW^ < 2E k (*jk ~ a jk ) 



1/2 



+ 2 min E max 
D j 



E t iX ' k ~ aik)2 ) 1/2 . (4) 



"I 



We now derive a bound which depends only on the variances of the Xjk- 



Corollary 3. Fix the m x n matrix A and let X be a random matrix with independent entries so 
that EX = A. Then 

/ \l/2 /_ Var/ Y., \\ 1 / 2 

E \\A - XW^^ < 2 . k Var(X jfc )) + 2Vmminmax 



D j 



E, 



Var(X 



4 



where D is a positive diagonal matrix with trace(.D 2 ) = 1. 



Proof. Let F and S denote, respectively, the first and second term of dH). An application of Jensen's 

( \ 1/2 

inequality shows that F < 2 ( • k Yav(Xjk) J . A second application shows that 



S < 2 min ( E max > 

~ D \ 3 ^ 



(Xjk — djkY 



2\ 1/2 



4 
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Bound the maximum with a sum: 

fv n A2\l/2 
(Ajfc — djk) \ 

IL. / , ■ 



5<2min (V [V 



"it 

The sum is controlled by a multiple of its largest term, so 

1/2 



S < 2 y/m mm (^max ^ fc ^ J > 



where m is the number of rows of A. □ 

6.1. Optimality. We now show that Theorem [J] gives an optimal bound, in the sense that each of 
its terms is necessary. In the following, we reserve the letter D for a positive diagonal matrix with 
trace(Z) 2 ) = 1. 

First, we establish the necessity of the Frobenius term by identifying a class of random matrices 
whose oo — >2 norms are larger than their weighted 2^oo norms but comparable to their Frobenius 
norms. Let Z be a random m x ^fm matrix such that the entries in the first column of Z are 
equally likely to be positive or negative ones, and all other entries are zero. With this choice, 
E ll^lloo^2 = E ||Z|| F = v 7 ^- Meanwhile, E HZD" 1 ^^ = ^, so min c E \\ZD~ 1 = 1, 

which is much smaller than E ||^|| 00 _ > 2- Clearly, the Frobenius term is necessary. 

Similarly, to establish the necessity of the weighted 2 — ► oo norm term, we consider a class 
of matrices whose oo — >2 norms are larger than their Frobenius norms but comparable to their 
weighted 2^oo norms. Consider a y/n x n matrix Z whose entries are all equally likely to be 
positive or negative ones. It is a simple task to confirm that E ||Z|| co _ + 2 — n anc ^ E ||Z|| F = n 3//4 ; it 
follows that the weighted 2— >oo norm term is necessary. In fact, 

( " Z2 k\ 1 ' 2 ( n 1 

minE ||ZD _1 |L =minE max > — ^— = min ( > —~- ) = n, 



so we see that E ||Z|| 00 _ > 2 an( ^ the weighted 2^oo norm term are comparable. 

6.2. Example application. From Theorem[5]we infer that a good scheme for sparsifying a matrix 
A while minimizing the expected relative oo — >2 norm error is one which drastically increases the 
sparsity of X while keeping the relative error 

E||Z|| F + min jD E||ZD" 1 || 2 ^ oo 

ll^lloo— *2 

small, where Z = A — X. 

As before, consider the case where A is an n x n matrix all of whose entries are positive and 
in an interval bounded away from zero. Let 7 be a desired bound on the expected relative 00 — >2 
norm error. We choose the randomization strategy Xjk ~ ^ Bern(p) and ask how much can we 
sparsify while respecting our bound on the relative error. That is, how small can p be? We appeal 
to Theorem [5J In this case, 

IWI«~ a = (E, E fe a h + 2 E, E, <m W») 1 = 0((n 2 + n\n - l)f 
By Jensen's inequality, 

^) Mp .o(„(i + ± 

We bound the other term in the numerator, also using Jensen's inequality: 



E||Z|| F < E||A|| F + E||X|| F < 1 + — ||A|| F = 0( n ( 1 + 



minE||Z J D- 1 || 2 ^ oo < y^E \\Z\\ 2 ^ < ^ (l + WM^ = o(n (l + 
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to get 



E||Z|| F + min D E||Z.D- 1 || 2 ^ oo ^( \ \ \ ^( 1 



O ^ + — =0 



ll A lloo^2 Vv™ VP™/ \Vpn, 

We conclude that, for this class of matrices and this family of sparsification schemes, we can reduce 



the number of expected nonzero terms to O j while maintaining an expected oo — >2 norm relative 
error of 7. 

7. Spectral error bound 

In this section we establish a bound on E||-A — _X"|| as an immediate consequence of Latala's 
result [Lat05j . We then derive a deviation inequality for the spectral approximation error using a 
log-Sobolev inequality from [BLM03J, and use it to compare our results to those of Achlioptas and 
McSherry |AM07j and Arora, Hazan, and Kale |AHK06| . 

Theorem 5. Suppose A is a fixed matrix, and let X be a random matrix with independent entries 
for which EX = A. Then 



E\\A-X\\ < C 



^ k ,^ JKJ J + max . Var ( X jk ) ) + , fc E(X jk - a jk ) ) 



where C is a universal constant. 

In [Lat05] . Latala considered the spectral norm of random matrices with independent, zero-mean 
entries, and he showed that, for any such matrix Z, 



EIIZII < C 



m r (E k £z h) + m fc ax + (E iJfc E 4) 



where C is some universal constant. Unfortunately, no estimate for C is available. Theorem [5] 
follows from Latala's result, by taking Z = A — X . 

The bounded differences argument from Section U] establishes the correct (subgaussian) tail be- 
havior of E ||A — JSC || . 

Theorem 6. Fix the matrix A, and let X be a random matrix with independent entries for which 
EX = A. Assume \Xj k \ < D/2 almost surely for all j, k. Then, for all t > 0, 

P{\\A-X\\ > E\\A-X\\ +t) <e"* 2/(4D2) . 

Proof. The proof is exactly that of Theorem [21 except now u and v are both in the £2 unit 
sphere. □ 

We find it convenient to measure deviations on the scale of the mean. 

Corollary 4. Under the conditions of Theorem® for all 5 > 0, 

P (|| A - X|| > (1 + S)E \\A -X\\)< e-* 2 (E||A-x||) 2 /(4^)_ 



7.1. Comparison with previous results. To demonstrate the applicability of our bound on the 
spectral norm error, we consider the sparsification and quantization schemes used by Achlioptas 
and McSherry [AM07] . and the quantization scheme proposed by Arora, Hazan, and Kale [AHK06]. 
We show that our spectral norm error bound and the associated concentration result give results of 
the same order, with less effort. Throughout these comparisons, we take A to be a m x n matrix, 
with m < n, and we define b = maxj k \aj k \. 
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7.1.1. A matrix quantization scheme. First we consider the scheme proposed by Achlioptas and 
McSherry for quantization of the matrix entries: 



b with probability \ + ^ 



jk | -b with probability \ — ^ ' 

With this choice Var(X,- fc ) = b 2 - a 2 jk < b 2 , and E(X jk - a jk ) 4 = b 2 - 3a 4 + 2a 2 b 2 < 3b 4 , so the 
expected spectral error satisfies 

E\\A - X\\ < C(^nb + ^mb + b\/3mn) < 4Cb^n. 

Applying Corollary 01 we find that the error satisfies 

P (|| A - X\\ > 4C^/n"(l + 5)) < e~ &2Q2n . 

In particular, with probability at least 1 — exp(— C 2 n), 

||A-X|| < 8Cb^. 

Achlioptas and McSherry proved that for n > no, where no is on the order of 10 9 , with probability 
at least 1 — exp(— 19(logn) 4 ), 

||A-X|| < Ab^n. 

Thus, Theorem [6] provides a bound of the same order in n which holds with higher probability and 
over a larger range of n. 

7.1.2. A nonuniform sparsification scheme. Next we consider an analog to the nonuniform sparsi- 
fication scheme proposed in the same paper. Fix a number p in the range (0, 1) and sparsify entries 
with probabilities proportional to their magnitudes: 



X jk ~ ^ Bem(p jk ), where p jk = max |p (^) , \jv (~^) x (8 log n) 4 /n j . 
Achlioptas and McSherry determine that, with probability at least 1 — exp(— 19(logn) ), 



|| A - X\\ < Abyjn/p. 

Further, the expected number of nonzero entries in X is less than 

pmn x Avg[(ajfc/&) 2 ] +m(81ogn) 4 . (5) 

Their choice of pj k , in particular the insertion of the (8 log n) 4 /n factor, is an artifact of their 
method of proof. Instead, we consider a scheme which compares the magnitudes of aj k and b to 
determine pj k . Introduce the quantity R = max^.^o b/\aj k \ to measure the spread of the entries 
in A, and take 



Xj k ~ < 



^ Bem(p jk ), where p jk = p JT+ 62 > a jk + 



0, a jk = 0. 

With this scheme, Var(Xj k ) = when aj k = 0, otherwise V&r(Xj k ) = b 2 /p. Likewise, E(Xj k 
o-jk) 4 = if aj k = 0, otherwise 

l|2 b 2 [, , Jpa] k + b 2 \\ 2 b 4 „ 
E(X jk -a jk ) < V&r(X jk ) \\X jk - = — max j \a jk \, \a jk \ I — X l J -^2^' 

so 

EILA-XII < C ( b./^ + b./^ + bJ-^rfm] < G(2 + VS)6. 

\ V P V P V p / VP 
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Applying Corollary HJ we find that with probability at least 1 — exp(-C 2 (2 + y/R) 2 jm/16), 

\\A -XII < 2C(2 + VR)b x /^. 

V p 

Thus, Theorem [5] and Achlioptas and McSherry's scheme-specific analysis yield results of the same 
order in n and p. As before, we see that our bound holds with higher probability and over a larger 
range of n. Furthermore, since the expected number of nonzero entries in X satisfies 



,2 



^/jk J *-^jk paj., + b z 



we have established a smaller limit on the expected number of nonzero entries. 



7.1.3. A scheme which simultaneously sparsifies and quantizes. Finally, we use Theorem [6] to esti- 
mate the error in using the scheme from IKIHi" which simultaneously quantizes and sparsifies. 
Fix S > and consider 

x ^ = |sgn(a i& )^Bern(^) J \a jk \ < ^ 
\ajk, otherwise. 

Then Var(Xjfc) = if \cij k \ > 5/y/n, otherwise 



Var(X jk ) = \a jk \ 3 -^ - 2a 2 jk + < — • 



Also the fourth moment term is zero for large enough aj k , otherwise 

\fn .a , , i 8 .9^ 2 , , / <5 \ 3 <5 

n 



r— g j2 / x \ ^ 

E(X jk - a ifc ) 4 = \a jk \ 5 — - 4al + 6|a jfc | 3 — - 4aL h \a jk \ (—7=) < 8- 



This gives the estimates 



E ||A - XII < C ( + + 2-)i) < 4C<5 



and 

P (\\A - X\\ > AC5{j + 1)) < e-T 2c2n . 
Taking 7 = 1, we see that with probability at least 1 — exp(— C 2 n), 

\\A- X\\ < 8C5. 

Let S = k \Aj k \, then appealing to Lemma 1 in |AHK06| . we find that X has o(^^\ nonzero 

entries with probability at least 1 — exp ^— \ ■ 

Arora, Hazan, and Kale establish that this scheme guarantees ||A — X|| < 0(5) with probability 
at least 1 — exp(— O(n)), so we see that our general bound recovers a bound of the same order. 

In conclusion, we see that the bound on expected spectral error in Theorem[6]in conjunction with 
the deviation result in Corollary H] provide guarantees comparable to those derived with scheme- 
specific analyses. We anticipate that the flexibility demonstrated here will make these useful tools 
for analyzing and guiding the design of novel sparsification schemes. 
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